Search CORE

198 research outputs found

Finding answers to questions, in text collections or web, in open domain or specialty domains

Author: Grau Brigitte
Publication venue: 'IGI Global'
Publication date: 01/01/2012
Field of study

International audienceThis chapter is dedicated to factual question answering, i.e. extracting precise and exact answers to question given in natural language from texts. A question in natural language gives more information than a bag of word query (i.e. a query made of a list of words), and provides clues for finding precise answers. We will first focus on the presentation of the underlying problems mainly due to the existence of linguistic variations between questions and their answerable pieces of texts for selecting relevant passages and extracting reliable answers. We will first present how to answer factual question in open domain. We will also present answering questions in specialty domain as it requires dealing with semi-structured knowledge and specialized terminologies, and can lead to different applications, as information management in corporations for example. Searching answers on the Web constitutes another application frame and introduces specificities linked to Web redundancy or collaborative usage. Besides, the Web is also multilingual, and a challenging problem consists in searching answers in target language documents other than the source language of the question. For all these topics, we present main approaches and the remaining problems

Limsiiles: Basic english substitution for student answer assessment at semeval 2013

Author: Gleize Martin
Grau Brigitte
Publication venue: HAL CCSD
Publication date: 01/06/2013
Field of study

International audienceIn this paper, we describe a method for assessing student answers, modeled as a paraphrase identification problem, based on substitution by Basic English variants. Basic English paraphrases are acquired from the Simple English Wiktionary. Substitutions are applied both on reference answers and student answers in order to reduce the diversity of their vocabulary and map them to a common vocabulary. The evaluation of our approach on the SemEval 2013 Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge data shows promising results, and this work is a first step toward an open-domain system able to exhibit deep text understanding capabilities

A Thematic Segmentation Procedure for Extracting Semantic Domains from Texts

Author: Ferret Olivier
Grau Brigitte
Publication venue: HAL CCSD
Publication date: 01/01/1998
Field of study

International audienceThematic analysis is essential for a lot of Natural Language Processing (NLP) applications, such as text summarization or information extraction. It is a two-dimensional process which has both to identify the thematic segments of a text and to recognize the semantic domain concerned by each of them. This second task requires having a representation of these domains. Such representations are built in Information Retrieval or Text Categorization fields by grouping together the words of a set of texts which have been manually linked to the same domain. We claim that this kind of method can only be apply to characterize very general topics. We propose here a method for building the representation of narrower semantic domains without any manual intervention. First, we present a procedure for the thematic segmentation of texts which relies on lexical cohesion evaluated from a collocation network. This procedure allows us to have basic units that are more thematically coherent than a whole text. Then, we show how these units can be aggregated together, according to a similarity measure, to build the representation of semantic domains in an incremental and unsupervised way

A Unified Kernel Approach For Learning Typed Sentence Rewritings

Author: Gleize Martin
Grau Brigitte
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

International audienceMany high level natural language processing problems can be framed as determining if two given sentences are a rewriting of each other. In this paper, we propose a class of kernel functions, referred to as type-enriched string rewriting kernels, which, used in kernel-based machine learning algorithms, allow to learn sentence rewritings. Unlike previous work, this method can be fed external lexical semantic relations to capture a wider class of rewriting rules. It also does not assume preliminary syntactic parsing but is still able to provide a unified framework to capture syntactic structure and alignments between the two sentences. We experiment on three different natural sentence rewriting tasks and obtain state-of-the-art results for all of them

Crossref

Answer type validation in question answering systems

Author: Grappy Arnaud
Grau Brigitte
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceIn open-domain question-answering systems, numerous questions wait for answers of an explicit type. For example, the question ``Which president succeeded Jacques Chirac?'' requires an instance of president as the answer. The method we present in this article aims at verifying that an answer given by a system corresponds to the given type. This verification is done by combining criteria provided by different methods dedicated to verifying the appropriateness between an answer and a type. The first types of criteria are statistical and compute the presence rate of both the answer and the type in documents, other criteria rely on named entity recognizers and the last criteria are based on the use of Wikipedia

A Framework of Evaluation for Question-Answering Systems

Author: El Ayari Sarra
Grau Brigitte
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/04/2009
Field of study

International audienceEvaluating complex systems is a complex task. Evaluation campaigns are organized each year to test different systems on global results, but they do not evaluate the relevance of the criteria used. Our purpose consists in modifying the intermediate results created by the components and inserting the new results into the process, without modifyingthe components. We will describe our framework of glass-box evaluation

A Corpus for Hybrid Question Answering Systems

Author: Grau Brigitte
Ligozat Anne-Laure
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

International audienceQuestion answering has been the focus of a lot of researches and evaluation campaigns, either for text-based systems (TREC and CLEF evaluation campaigns for example), or for knowledge-based systems (QALD, BioASQ). Few systems have effectively combined both types of resources and methods in order to exploit the fruitful- ness of merging the two kinds of information repositories. The only evaluation QA track that focuses on hybrid QA is QALD since 2014. As it is a recent task, few annotated data are available (around 150 questions). In this paper, we present a question answering dataset that was constructed to develop and evaluate hybrid question an- swering systems. In order to create this corpus, we collected several textual corpora and augmented them with entities and relations of a knowledge base by retrieving paths in the knowledge base which allow to answer the questions. The resulting corpus contains 4300 question-answer pairs and 1600 have a true link with DBpedia

Crossref

A hierarchical taxonomy for classifying hardness of inference tasks

Author: Gleize Martin
Grau Brigitte
Publication venue: HAL CCSD
Publication date
Field of study

International audienceExhibiting inferential capabilities is one of the major goals of many modern Natural Language Processing systems. However, if attempts have been made to define what textual inferences are, few seek to classify inference phenomena by difficulty. In this paper we propose a hierarchical taxonomy for inferences, relatively to their hardness, and with corpus annotation and system design and evaluation in mind. Indeed, a fine-grained assessment of the difficulty of a task allows us to design more appropriate systems and to evaluate them only on what they are designed to handle. Each of seven classes is described and provided with examples from different tasks like question answering, textual entailment and coreference resolution. We then test the classes of our hierarchy on the specific task of question answering. Our annotation process of the testing data at the QA4MRE 2013 evaluation campaign reveals that it is possible to quantify the contrasts in types of difficulty on datasets of the same task

Système d'aide à l'accès lexical : trouver le mot qu'on a sur le bout de la langue

Author: Grau Brigitte
Lortal Gaëlle
Zock Michael
Publication venue: HAL CCSD
Publication date: 19/04/2004
Field of study

International audienceThe study of the Tip of the Tongue phenomenon (TOT) provides valuable clues and insights concerning the organisation of the mental lexicon (meaning, number of syllables, relation with other words, etc.). This paper describes a tool based on psycho-linguistic observations concerning the TOT phenomenon. We've built it to enable a speaker/writer to find the word he is looking for, word he may know, but which he is unable to access in time. We try to simulate the TOT phenomenon by creating a situation where the system knows the target word, yet is unable to access it. In order to find the target word we make use of the paradigmatic and syntagmatic associations stored in the linguistic databases. Our experiment allows the following conclusion: a tool like SVETLAN, capable to structure (automatically) a dictionary by domains can be used sucessfully to help the speaker/writer to find the word he is looking for, if it is combined with a database rich in terms of paradigmatic links like EuroWordNet

HAL AMU

Generating a training corpus for OCR post-correction using encoder-decoder model

Author: D'hondt Eva
Grau Brigitte
Grouin Cyril
Publication venue: HAL CCSD
Publication date: 27/11/2017
Field of study

International audienceIn this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or exter- nal information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short- Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, in- cluding a real-life OCR corpus in the med- ical domain